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Abstract 

The effects of several nonlinear regularization techniques are discussed in the framework of 3D 
seismic tomography. Traditional, linear, £2 penalties are compared to so-called sparsity promoting 
t\ and £q penalties, and a total variation penalty. Which of these algorithms is judged optimal 
depends on the specific requirements of the scientific experiment. If the correct reproduction of 
model amplitudes is important, classical damping towards a smooth model using an £2 norm works 
almost as well as minimizing the total variation but is much more efficient. If gradients (edges of 
anomalies) should be resolved with a minimum of distortion, we prefer £\ damping of Daubechies-4 
wavelet coefficients. It has the additional advantage of yielding a noiseless reconstruction, contrary 
to simple £2 minimization ('Tikhonov regularization') which should be avoided. In some of our 
examples, the £q method produced notable artifacts. In addition we show how nonlinear £\ methods 
for finding sparse models can be competitive in speed with the widely used £2 methods, certainly 
under noisy conditions, so that there is no need to shun £\ penalizations. 

1 Introduction 

Since geophysical inverse problems are almost always underdetermined, regularization techniques are 
essential to obtain a meaningful solution. Two major classes of techniques exist. The first one, named 
'mollifying' in the mathematical literature, or 'optimally localized averaging' (OLA) in helioseismology, 
can be traced back to the groundbreaking work of Backus and Gilbert [H [2] in geophysics. In this 
approach one searches for the size of an averaging volume that can produce a local average of the 
model parameter with an acceptable variance. Since this method is computationally very expensive, 
it has found little application in large-scale geophysical inversions such as seismic tomography. To 
limit the computational effort, seismic tomographers instead search for a biased ('damped') solution. 
This has the disadvantage of introducing a systematic error - the bias - in lieu of the random error 
caused by the propagation of data errors. It can be turned into an advantage if the bias actually 
reflects a justified disposition of the scientist to prefer certain models over others, as long as the data 
are fit within their error bars. 

Simple ^2-norm damping, which biases model perturbations towards zero in the absence of in- 
formation based on the data, is generally a bad choice to regularize the inverse problem for seismic 
tomography as it tends to introduce structures reflecting ray coverage into the images. For that reason, 
most tomographers prefer to bias the model towards 'smooth' anomalies, in effect trying to forge a 
compromise between Backus- Gilbert theory and the efficiency of damped inversions. The smoothness 
of the images has the advantage that large structures become easily visible. Sharp discontinuities, 
however, are blurred, and smaller structures, even when resolved, may be diluted beyond recognition. 



Recently, [3] — hereafter referred to as Paper I — introduced a third option for the bias in geophysical 
inversions: to minimize the £i-norm of the wavelet decomposition. This also biases the model towards 
zero, but it turns out that such reconstructions always have many (or most) wavelet coefficients exactly 
equal to zero (i.e. they are sparse). In a synthetic 2D experiment using surface wave data, we showed 
how structurally coherent features (in a geophysical sense) , were more faithfully reproduced using this 
technique than with a simple ^-norm damping. In addition, as a result of their inherent sparsity, l\ 
reconstructions exhibit much less noise than their £2 counterparts. 

Though Paper I clearly showed the feasibility of wavelet-based £\ regularization it left a number of 
questions unanswered, in particular which wavelet families work best, how they would perform against 
more sophisticated £2 norms (e.g. smoothness damping) and whether the computational feasibility as 
well as the positive conclusions for wavelet regularization do scale up to large, 3D models. 

In this paper we therefore aim to refine the original conclusions of Paper I and to enlarge the scope 
of the investigation. We extend tests to 3D inversions of body wave travel times and investigate the 
use of different families of wavelets (Haar, D4, dual tree). We include a comparison with smoothness 
damping, and with a fourth option named 'total variation' damping. Unlike Paper I, we also discuss 
so-called £q constrained recovery. In contrast to the £\ norm technique which relies on iterative soft- 
thresholding, the £q recovery method uses iterative /lard-thresholding of wavelet coefficients [H [5] . 

The (salt dome) model that we try to reconstruct here is more realistic than the model in Paper I 
and includes a wide range of length scales. The problem described in Paper I was also of a very limited 
size: there were only about 10 4 degrees of freedom in the reconstructed models. Here we perform 3D 
reconstructions, and increase the number of degrees of freedom by an order of magnitude to about 
~ 10 5 . The number of data also increases accordingly to 24000. Our approach is complementary 
to that of [6] who expand the Frechet kernels into wavelets to obtain a significant reduction in the 
memory requirements to store the kernel. 

One disadvantage of the £i-norm, £q and total variation penalties is that they lack the convenient 
linearity of the more conventional ^2-norm minimizations. Making use of recent algorithmic improve- 
ments, we do demonstrate that finding a (nonlinear) sparse model reconstruction is not necessarily 
more expensive, computation-wise, than £2 based (linear) reconstructions. 

The use of £\ norms in seismic tomography was to the best of our knowledge first proposed in [7] 
in the form of an iteratively reweighted least squares method (IRLS, see also [8]), but has never found 
much favor, possibly because the convergence of IRLS was not guaranteed. Besides its use in seismic 
tomography £\ norms have found application in other geophysical contexts such as deconvolution and 
interpolation [9l 1101 fTT|. I12j. The use of £\ norms in combination with a carefully chosen basis (such 
as wavelets) is, however, more recent and is largely inspired by the recent development of compressed 
sensing \Y'6\ [T4"l [To] , that shows that under certain conditions an exact reconstruction can be obtained 
by solving an £\ regularized inverse problem, provided there is an underlying basis in which the desired 
model is sparse. This emphasizes the importance of studying different "dictionaries" as we do here. 
Recently [16] has successfully applied the compressed sensing idea to wavefield reconstruction, albeit on 
small-scale problems only. The success and promise of compressed sensing has therefore also increased 
interest in the speed-up of such £\ problems to be able to handle practical geophysical applications 
(e.g. HUES]). 

2 Forward problem formulation 

We plan to test the regularization methods on a synthetic data set generated for a salt dome model. 
Since the main goal of this paper is to evaluate and compare a number of algorithms, numerical 
efficiency is more important than the wish to have a tomographic problem at hand that is fully 
realistic. We have thus taken a few shortcuts to be able to run inversions quickly in Matlab on a single 
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processor. However, we took pains to ensure that we would invert for a model that has a large range 
of length scales, and that the ray coverage would encompass both dense and sparse regions. 

In dimensionless variables, the expression for a finite frequency sensitivity kernel corresponding to 
a constant background and a Gaussian power spectrum is given by the formula: 



where u = n(d s + d T — d ST )/X. Here d sr is the distance between source (earthquake) and receiver 
(station), and d B , d r are the distances to source and to receiver, measured from the point (x, y, z). A is 
the dominant wavelength and H$(u) = 120u — 160u 3 + 32u 5 denotes the Hermite polynomial of order 
5. Equation ([1]) can be derived from the expressions for the Frechet kernel in a homogeneous medium 
using an analytical expression for the spectral integration (see [19])- 

The constant background model that we use here is not at all realistic from a physical point of 
view for the salt dome input model that we will use in section 21 in case of of large velocity contrasts 
kernels are bent rather than straight as in expression ([T]). However, as we will use the same constant 
background kernels (pQ) for generating synthetic data as well as for reconstructing models from these 
data, we believe it is possible to accurately evaluate the effects of the regularization technique on the 
inversions. Obviously, reconstructing a model from actual measurements requires kernels of a more 
complicated shape than ([1]) but their resolving power is not fundamentally different from those used 
in our simple application. 

For each source-receiver pair and each dominant wavelength, the travel time differential and the 
model perturbation m are connected by the linear integral relation [200 



Given sufficiently many data, the aim of seismic tomography is to reconstruct the model m from a 
noisy version of the data vector d containing many travel time differentials corresponding to many 
source-receiver pairs. 

In section U] we will perform a number of seismic reconstructions. The domain on which we will do 
this is the cube V = [—1, l] 3 . For discretization, this domain is subdivided in 64 3 voxels, a convenient 
choice for the digital reconstruction of a model m, leading to 262144 degrees of freedom in m. In 
order to be able to produce meaningful reconstructions, we expect to need at least 10 data (about 
1 datum for 10 degrees of freedom). Hence we will choose 4800 sources-receiver pairs and 5 different 
dominant wavelengths so as to yield 24000 data. This represents an overparametrization by a factor 
of more than 10. Thus regularization will be an essential requirement for any data inversion. 

A very efficient set-up of our numerical experiment was obtained as follows: we first choose 100 
source-receiver pairs in random positions on the surface of the cube, while making sure source and 
receiver are never on the same face (the kernels (pQ) are not curved and would not be able to cross 
the model domain very much if source and receiver were on the same face). From these initial 100 
pairs, we construct the full set of 4800 pairs by using the 48 symmetry transformations of the cube. 
These 48 operations are constructed from the 3! permutations of the coordinates (x,y,z), and by 
the 2 3 reflections (±x, ±y, ±z), as listed in Table [TJ In other words, starting from one initial kernel 
K(x, y, z) we easily obtain 47 other kernels: K(z, x, y), K(y, x, —z), . . . , corresponding to differently 
positioned source-receiver pairs. The random nature of the positions of the original 100 source-receiver 
pairs (i.e. no coordinate is exactly zero or exactly equal to plus or minus another coordinate) ensures 
us that none of the 4800 source-receiver pairs are identical. The initial 100 source-receiver pairs and 
the final 4800 pairs are shown in Figure [TJ 
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Permutations 



x,y,z) (x,z,y) (y,x,z) 

x,y,-z) (x,z,-y) (y,x,-z) 

x,-y,z) (x,-z,y) (y,-x,z) 

x,-y,-z) (x,-z,-y) (y,-x,-z) 

-x,y,z) (-x,z,y) (-y,x,z) 

-x,y,-z) (-x,z,-y) (-y,x,-z) 

-x,-y,z) (-x,-z,y) (-y,-x,z) 

-x,-y,-z) (-x,-z,-y) (-y, -i 



{y,z,x) 
(y,z,-x) 

(y,-z,x) 

(y,-z,-x) 

{-y,z,x) 

{-y,z,-x) 
(-y,-z,x) 

z) {-y,-z,-x) 



z,x,y) 

z,x,-y) 

z,-x,y) 

z,-x,-y) 

-z,x,y) 

-z,x,-y) 

-z,-x,y) 

-z,-x,-y) 



z,y,x) 

z,y,-x) 

z,-y,x) 

z,-y,-x) 

-z,y,x) 

-z,y,-x) 

-z,-y,x) 

-z,-y,-x) 



Table 1: Left: A list of rotations and reflections that map the unit cube [—1, l] 3 onto itself. They are 
constructed by combining the six permutations of (x, y, z) and eight sign changes of the components 
(x,y,z). 




Figure 1: Left: The cube [— 1,1] 3 with the initial 100 source-receiver pairs (black=source, 
red=receiver). Center: The 4800 source-receiver pairs one obtains by applying the 48 symmetry 
transformations of Table [1] on the initial 100 pairs. Right: A single source-receiver pair joined by a 
straight line and its 48 transformations. Plotting all 4800 rays obtained in this way would fill the 
whole cube. 
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Figure 2: Cross sections (perpendicular to the midpoint of the central ray) of five finite sensitivity 
kernels corresponding to A 6 {0.5, 0.2, 0.08, 0.04, 0.025}. Larger wavelength kernels are wider but have 
smaller amplitude. 



For each of the 4800 source-receiver pairs we will construct five finite frequency sensitivity kernels 
of type dU corresponding to five different dominant wavelengths A = {0.5, 0.2, 0.08, 0.04, 0.025}. Thus, 
in total there will be 24000 kernels at our disposal. Because of the symmetry transformations used 
and the random choice of the initial 100 source-receiver pairs, the coverage of the domain by these 
24000 kernels is quite uniform. A picture that illustrates this property is too large to include here, 
but it is available in the online supplementary material in Figure A.l. 

In order to give the reader an idea of the size of the Fresnel zones of these finite frequency kernels, 
we include in Figure [2] and in one of the panels of Figures [6] and [7] a cross-sectional view of five kernels. 
Each of these kernels corresponds to one of the five wavelengths (A G {0.5,0.2,0.08,0.04,0.025}) and 
source-receiver distance d sr = 2. The width of the Fresnel zone is proportional to \/\d SI |21| . The 
cross sections in Fig. [2] illustrates not only the typical widths of the kernels, but also their relative 
amplitudes. 

Additionally we construct a second operator containing only 20000 out of the 24000 kernels. We 
choose to remove the 4000 kernels for which the line connecting source and receiver (i.e. the central 
ray) comes closest to the point (0.24, —0.7, —0.23). In this way, we end up with an operator that has a 
'hole' in its coverage of the cube (see Figure A.l, right). Using this operator we will be able to study 
the effect of non- uniform coverage on reconstructions. 

To discretize the model into voxels we calculate the integral of the sensitivity kernels over each 
voxel (using a Riemann sum with 4 3 terms/voxel) for each source-receiver pair and for each of the five 
dominant wavelengths we consider. The resulting values make up the operator A we want to invert: 
Because there are 24000 kernels and 64 3 = 262144 voxels, the matrix A will have 24000 rows and 
262144 columns. 

The use of the 48 symmetry transformations of the cube allows us to save a factor of 48 in memory 
requirements for our calculations, i.e. we need to compute and store only 500 kernels corresponding 
to the initial 100 source-receiver pairs and 5 wavelengths. The remaining ones are easily (and quickly) 
generated from these 500 kernels by the symmetry transformations of the cube (quick rearrangements 
of the elements in a 3D array). 

1 For convenience, we denote the model perturbation by m instead of 5m as is often done. 
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An additional saving in memory requirements is obtained by exploiting the fact that most kernels 
are well localized (i.e. they are thin), and thus that the Aij are practically zero for many voxels. In 
other words, each row of the matrix A is relatively sparse. 

Although this set-up is completely unrealistic from a physical perspective, this configuration of 
rays provides for an easy way to compare dense and partial data coverage of the model. It also allows 
us to focus on the relative merits of the inversion methods rather than on the difficulties a physically 
faithful modeling would entail. 

In reality, sensor coverage is limited to the surface and a few boreholes. High frequency data 
with narrow Fresnel zones therefore leave significant areas at depth not illuminated by acoustic waves. 
Low frequency data have wider Fresnel zones but suffer a reduced sensitivity to small length scales. 
Illumination itself does not guarantee resolution as becomes clear when one regards the singular value 
spectrum of A: though the spectrum depends strongly on the experimental set-up, it often has a 
rapid drop-off and a significant fraction of eigenvalues is either zero or too small to be useful. In the 
present synthetic experiment, we chose to have about 1 datum for every 10 degrees of freedom in the 
model. The uniform random distribution of sources and receivers on the surface of the cube has a 
favorable influence on the singular value spectrum of A (we were able to confirm this on a scaled- 
downed version of the present operator A, but not on the full 24000 x 262144 matrix), and will aid in 
the reconstruction . This will allow us to focus on the characteristics of the different reconstruction 
methods, rather than on the lack of data. 

3 Reconstruction methods 

Reconstructing the model m from the data d is done, in principle, by solving the linear system 
Am = d, where A denotes, as before, the matrix containing the kernels discretized in the voxel 
basis. This system may contain incompatible equations (due to noise), and at the same time be 
underdetermined (not enough data to reconstruct all of m). 

The problem of incompatible data can be solved by replacing the original problem with the mini- 
mization of a data fidelity term: 

m = argmin || Am — d|| 2 . (3) 

m 

Here and in the following ||u|| (without subscripts) always denotes the usual 2- norm of u: ||u|| = 
J52i u l Although a minimizer always exists (because of the quadratic nature of the functional), it 
may not always be unique. In other words, the problem is still underdetermined and an iterative 
numerical scheme for finding a minimizer of ([3]) may diverge. In fact, because of the existence of data 
errors we are not even looking for the exact minimizer ([3]). We rather augment the functional in ([3]) 
by a term that will penalize a whole category of models that is thought to be unphysical. 

3.1 £2 penalties 

The prime example of this kind of method is Tikhonov regularization [22] whereby a penalty propor- 
tional to the ^2-norm of the model is imposed: 

m = argmin ||Am — d|| 2 + /i||m|| 2 . (4) 

m 

This will effectively prevent the model from growing unboundedly due to noise in d and ill-conditioning 
of the matrix A. 

Another, closely related, possibility is to impose a penalty consisting of the ^-norm of the (discrete) 
Laplacian Am of m: 

m = argmin ||Am — d|| 2 + /x||Am|| 2 . (5) 
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As we will see in section 01 this will enforce a certain degree of smoothness on the reconstructed model. 
The Laplacian used here is defined as the difference of the model with the local average over the six 
nearest neighbors: (Am)^ = m ijk - (mj_i jA + m i+ljk + . . . + m ijk+ i)/6. 

The above two versions of the Tikhonov regularization method have the advantage of being solvable 
by linear equations. The variational equations that determine the minimizers of and ([5]) are: 

A T Am + [i m = A T d , (6) 

and 

A T Am + n A T Am = A T d , (7) 

respectively (with suitable treatment of boundary voxels, A = A in the latter case). For these 
linear equations, we can use the conjugate gradient algorithm. With =arbitrary and = 

A T (d - AmW) - /iD r Dm("), we set 

f rW n = 

v [n) _ I || r (n)||2 , „ 

L ^ Hrf"- 1 )!! 2 * (g) 

llv(«)ll 2 

m (n+l) = m (n) rT JP— | — vW, 

||AvW|| 2 + /i||Dv(™)|| 2 

where D is either the unit matrix I (in case we seek to minimize functional or A (in case we 
use functional ©). The model estimates m( n ) converge to the minimizer Q or ([5]), respectively as n 
increases. 



3.2 li penalties 

Another and much more recent method of regularization consists of imposing a carefully chosen l\- 
norm penalty |23j . It can be shown that this leads to a sparse model, i.e. a model with few nonzero 
components [21 [15]. It would therefore not be a good idea to apply this technique to the model in the 
voxel basis (there is no reason to assume the model would be sparse in that basis); we would rather 
use this penalty on the coefficients of the model in a different basis in which we believe the model to 
be sparse. 

Harmonic functions would allow us to select resolvable scales, but the complete lack of localization 
of these functions makes them even worse candidates than voxels. 3D wavelets offer a compromise 
between the concentration of power in both scale and location, and are intuitively more suitable to 
build geophysically reasonable models. In fact, our earlier experience (Paper I) showed the advantages 
of using a wavelet basis, and constructing the model by finding the minimum of the functional 

||Am-d|| 2 + 2/i||w||i, (9) 

where w = Wm are the wavelet coefficients of m. This minimization problem can be rewritten as: 

m = W _1 w and w = argmini ? (w) with F(w) = II AW _1 w — dll 2 + 2//||w||i. (10) 

w 

W is the wavelet decomposition matrix and W _1 the wavelet synthesis operator. This type of l\- 
norm penalty leads to a model that has a sparse wavelet representation, i.e. a model with (very) few 
nonzero wavelet coefficients. The aim is thus to rely on the properties of the wavelet basis to be able 
to represent the desired solution with few nonzero components. In geophysics wavelets are a good 
choice for seismic reconstruction as they allow for sparse representations of overall smooth functions, 
while still capable of taking into account the possibility of isolated sharp features [24|, [3] . 
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Another advantage of the i\ method is that they yield noise-free model reconstructions. This is a 
consequence of the simple fact that noise in the model cannot be represented in a sparse way (in any 
reasonable basis). This method is thus able to produce clean models without the need for additional 
smoothing. It is important to note that the least squares functional fjlUj) is convex. This implies that 
a local minimum of (|10p is always a global minimum as well. 

In section 01 we shall consider a number of different choices of orthonormal wavelet bases. For 
each of these choices, we have W _1 = W T , which we will implicitly assume hereafter. 

In order to find the minimizer w of the t\ penalized functional ()10p . one may use the iterative 
soft-thresholding algorithm [23]: 

w (n+l) = T ( W W) 5 (11) 

with 

T(w) = S aii [w + aWA T (d - AW T w)] , (12) 

and where w^ ) may be chosen arbitrarily. The soft-thresholding S T {u) function operates component- 
wise and is defined by 

{U — T U > T 

\u\ < T (13) 

u + r u < —t. 

This algorithm was used in a 2D seismic tomography toy problem in Paper I, to which we also 
refer for a brief but elementary derivation (section 2 of paper I). In effect, it is a simple gradient 
descent algorithm (with fixed step length a) where the additional soft-thresholding operation S afl is 
a mathematical consequence of the £i-norm term present in functional (|10p . 

The constant a should be chosen such that a||A T A|| is smaller than or equal to 1 (||A T A|| is 
defined as the largest eigenvalue of A T A)[23]. In our calculation we always choose a = 0.95/\\A T A\\. 

The wavelet transform W and its inverse ~W T are fast transforms. This means that they cost only a 
fraction of the computer time needed to perform one application of A or A T . In other words, working 
in a wavelet basis does not significantly change the computational complexity of the reconstruction 
algorithm. 

As the iterative soft-thresholding algorithm (jlip can be slow in practice, we have opted here for 
using the so-called Fast Iterative Soft-Thresholding Algorithm (FISTA) [25] (see also earlier work of 
Nesterov HUE?]): 

w(" +1 ) = T (w(«) + (wW - w^)) , (14) 

with to = 1 and t n+ \ = (1 + + 4i 2 )/2|l The FISTA algorithm has practically the same compu- 
tational complexity as the iterative soft-thresholding algorithm (|lip . It only requires one additional 
vector addition. With this algorithm the i\ penalized cost function (jlOp . evaluated at w = w^, is 
bounded by 0(l/n 2 ) from its limiting value: 

i ^ llw(°) - wll 2 

F wW - F w < 4 11 1 , (15) 

a (n + If 

as opposed to the C(l/n) decrease: 

< ^ llw(°) - wll 2 
F(w^) - F(w) < L (16) 

a n 

that can be proven for algorithm (|lip . These upper bounds are valid non-asymptotically, i.e. even for 
small n 



2 In the journal published version of this manuscript this formula contains a typo. 
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3.3 £ penalties 

Recently, mathematical advances on direct ways of constraining the number of nonzero components 
in a reconstructed model have appeared. In [H [5] an iterative hard-thresholding algorithm is proposed 
of the following form: 

w (n+l) = f( w (n)) 5 ( 17 ) 

with 

f (w) = H k [w + aWA T (d - AW T w)] (18) 

and where the hard-thresholding operation %fc(u) sets all but the largest (in absolute value) k com- 
ponents of u to zero. This algorithm converges to a local minimum of 

|| AW _1 w — d|| 2 under the condition || w ||o < k. (19) 

Here ||w||o denotes the number of nonzero coefficients of w. Just as with the l\ penalty method, we 
shall apply this technique using a wavelet basis. The underlying reason is again the suitability of a 
wavelet basis to represent a physically acceptable model with few nonzero wavelet coefficients. We 
stress that the appearance of the hard-thresholding operation Tik is a mathematical consequence of 
the constraint ||w||o < k, just as soft-thresholding in (fT2|) is a mathematical consequence of the l\ 
term in (fTO]) [2H | 13] . 

The algorithm (|17p converges very slowly. We therefore propose the following FISTA /Nesterov-like 
modification: 

w(" +1 ) = f |w(") + (wW - wC"" 1 ))) , (20) 

with to = 1 an d = (1 + \/l + 4t 2 ) /2H As far as the authors know, this is the first time this 
algorithm is proposed. Although there is no proof of convergence yet, we found that it worked quite 
well on the examples that we studied (see section 2]) . We used the same choice for the step-length 
a as with the t\ algorithm: a = 0.95/|| A T A||. The choice of the number of nonzero model wavelet 
coefficients k in method (|19p is discussed in section 13.51 

On a side note, it would also be possible to calculate a local minimizer of the (non-convex) func- 
tional 

||AW _1 w-d|| 2 + /x||w|| (21) 

using component-wise hard-thresholding (with a fixed threshold ^/Jl) [3J. However, [5] seems to prefer 
the formulation (|19j) that imposes a fixed number of nonzeros in each step of the iteration, rather than 
a fixed lower bound ^fji for the absolute values of the wavelet coefficients. 

Although no proof of convergence of algorithm (|20p is given, we shall refer to it by the name 1 £q 
method' (instead of calling it 'hard-thresholded Nesterov accelerated gradient descent algorithm'). 

3.4 Total variation penalty 

A final penalty term we will consider is the so-called 'total variation' (TV) penalty: 



m = argmin ||Am - d|| 2 + 2/x ^ J {A x m) 2 + (A y m) 2 + (A z m) 2 , (22) 

ijk 

with A x m = niijk — mj_ij/% and similarly for A y and A z . This penalization will favor piece- 
wise constant models in the voxel basis. Unfortunately, the equations that determine the mini- 
mizer (|22p are again nonlinear. We will use a reweighed conjugate gradient method |29] to de- 
termine the minimum of the TV functional (|22p . More specifically, defining the weights u( n ) = 



3 In the journal published version of this manuscript this formula contains a typo. 
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, we shall use algorithm ([8]) where we choose D T D = A^u^A x + 
A^u^Aj, + A^u^A z (which depends on the iteration n) and use that ||Dv( n )|| = (D T DvW,vW). 
Because of the non-quadratic nature of functional (|22p . the conjugate gradient algorithm no longer 
preserves conjugacy between successive search directions as n grows. For this reason, the iteration 
is also reset every so often (in accordance with |29j). 

An algorithm similar to (|lip and (|14p may also be used to find the minimizer of the TV penalized 
functional (|22p [301 131J . In fact, in [30] it is shown how the t\ norm penalty and the TV penalty can 
be combined in a single functional. 

3.5 Choice of the penalty parameter 

As such, the minimizers defined by (|3J), ([5]), ([TO]) and ([22]) still depend on the penalty parameter fj,, or 
in case of the hard-thresholding algorithm (|20p on the parameter k. In the reconstructions below, we 
select this parameter by requiring in each instance that the reconstructed model rh fits the data d as 
well as possible, but not any better than the noise level: 

||Am - d|| « ||n||, (23) 

with n representing the noise vector. In other words, the discrepancy principle tells us to choose the 
penalty parameter \i (or k) such that 

1 1 Am — d|| 2 /o" 2 ~ number of data, (24) 

where a is the noise variance (if different data have different variance, it is simplest to divide from the 
outset each row as well as the right hand side by the standard deviation of its datum, and set a = 1 
in (|24p ). In practice, this means that we will have to perform the minimization several times, until a 
suitable value of /i or k is determined for each reconstruction. 



4 Reconstructions 

In this section we present some sample reconstructions using the algorithms mentioned in section [3l 
applied to the finite frequency tomography problem described in section [2j First we will consider a 
simple checkerboard input model that we will try to reconstruct from incomplete and noisy data. We 
will also look at a more complicated salt dome model which we obtained from BP America, Inc. 

In each case, our procedure will be the following. We start from a known input model m mput from 
which we construct synthetic noisy data d: 

d = Am input + n. (25) 

The noise n is taken from a Gaussian distribution, with zero mean and variance a chosen in such a 
way that ||n||/|| Am input || = 0.1; in other words, we add 10% noise to the noiseless data. The goal is 
then to try to reconstruct m input as well as possible from d and A. For this we will use methods ([3]), 
©) pop . ([19p and (|22|). and compare the results. Since we know the noise variance <r, we can use the 
criterion (I24p to choose the penalty parameter /i or k. This parameter will be different for the various 
synthetic data and for the various penalties that we impose. 

For the wavelet based methods (fT0|) and (fT9j) . we also need to choose a specific wavelet family. 
There are many wavelet bases, with varying degrees of smoothness and approximation properties [24] . 
This gives us the opportunity to adapt our choice of wavelet basis to the model: we will choose the 
basis in which we suspect the model to be sparse. In particular, for the checkerboard reconstruction 
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we will use Haar wavelets because we know that the true solution is very sparse in that basis. We thus 
expect a very accurate reconstruction in that case. For the salt dome input model, we will compare 
the effects of different choices of wavelets bases on the reconstruction. Our choice will include Haar 
wavelets, D4 wavelets and also directional dual tree wavelets. Haar wavelets are the least smooth, and 
directional wavelets are the most smooth of these three. 

4.1 Checkerboard 

The first example consists of a checkerboard pattern; in other words the input model m input is piecewise 
constant (±1) on small cubes of 8 by 8 by 8 voxels. It mainly serves as a proof of principle for the £\ 
wavelet method ()10p because we know that this input model is sparse in the Haar wavelet basis |32j. 
The Haar wavelets are piecewise constant. In fact, the model only has 99 nonzero coefficients in that 
basis (out of 262144). Hence, we may expect that the l\ method will work very well with this model 
and this basis. 

A single horizontal slice of the checkerboard input model and its four reconstructions are shown in 
Figure O These four reconstructions use all 24000 kernels so that no particular region in the model is 
favored or disadvantaged. The i\ reconstruction (|10p with Haar wavelets is visually the most faithful 
to the original, closely followed by the £o~Haar reconstruction. The ^-reconstructions (|4l5p and the 
TV reconstruction (|22p display smooth transitions between +1 and —1. The reconstruction result of 
the simple £2 method is quite noisy, which is not the case for the other methods. For completeness 
and viewing convenience, the online supplementary material includes a picture (Figure A. 2) of all the 
horizontal slices of the input model and its various reconstructions. 

The smoothing effect of the reconstructions can quantitatively be seen from the histogram of the 
reconstructed model amplitudes (see Figure 0]) . The £2 reconstruction takes on mostly values around 
zero, whereas the input model only has amplitudes +1 and —1 (vertical blue lines). In this case, 
the l\ Haar reconstruction does a very good job at recovering the ±1 amplitude distribution, as it 
is naturally well suited for the particular checkerboard model used. The £q method does second best 
and the TV method (|22p does third best from this point of view. It is surprising that the £\ method 
outperforms the £q method in this case. 

The checkerboard model shows that good reconstructions are possible if one has very good prior 
knowledge on the model. When using the £q and l\ methods this requires a basis in which the desired 
model is very sparse. The checkerboard model that was used here aligns optimally with the chosen 
Haar basis. The results of £\ and £q reconstructions deteriorate when the checkerboard pattern is 
shifted w.r.t the Haar basis or when the size of the fields are changed (the Haar basis decomposition 
of such a checkerboard would seize to be sparse). As such, the checkerboard model we choose is very 
particular. For realistic reconstruction scenarios one should not expect such excellent results. 

A discussion of the mean square error of the various reconstructions is given in section 14. 3\ where 
other computational aspects are also discussed. 

4.2 A 3D salt dome model 

In this section we try to reconstruct a complex 3D model of a realistic salt body in the subsurface. The 
complex salt dome model was obtained from a prestack depth-migration of field seismic data in the 
deep-water part of the Gulf of Mexico, and was kindly provided to us by BP America, Inc. To better 
accommodate the straight-ray tomography used in this paper, the sediment velocities surrounding 
the salt dome model that were present in the original model provided to us, were replaced with 
a constant velocity. This model was superimposed on a background model with long-wavelength 
variations (smoothed Gaussian). The model has a rather sharp contrast between the velocity in 
the salt and in the surrounding background model, providing for sharp edges. A single horizontal 
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Figure 3: A single horizontal slice (number 25 (from top) out of 64) of the checkerboard model and its 
reconstructions using all 24000 data. The whitish transitions in several of the reconstructions are the 
result of smoothing between +1 and —1 of the input model. As the input checkerboard has fields of 
size 8x8x8 with constant model value ±1, the smoothing occurs with period 8 as well. In addition, 
the £2 reconstruction has a distinctive noisy appearance. The i\ reconstruction using Haar wavelets 
is visually indistinguishable from the input model. The £q method does almost as well. 
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Figure 4: Histograms of the amplitude distribution of the five checkerboard reconstructions. The 
input model takes on values +1 and —1 only. The l\ reconstruction in the Haar basis, and to a lesser 
extent the ^crHaar reconstruction, result in almost perfect reconstruction of the amplitude distribution 
whereas the amplitudes are shifted towards zero by the other reconstruction methods. 



13 



m = 0.9 



m = 1.0 



m = 1.1 




Figure 5: Three contour plots, i.e. surfaces of constant model value, of the salt dome model (m = 
0.9,1.0,1.1). 

slice through the resulting salt dome model is pictured in Figures and [7J For completeness, all 64 
horizontal slices are shown in Figure A. 3 (right) of the online supplementary material (as well as the 
smoothed Gaussian that was added in; Figure A. 3, center). Three contour plots of this model are 
shown in Figure EH corresponding to the model values m = 0.9, m = 1 and m = 1.1. The model 
contains 64 x 64 x 64 = 262144 voxels as in the checkerboard examples. 

We perform the same type of experiment as before: we construct synthetic data and add 10% 
Gaussian noise to it. From this noisy data, we try to reconstruct the input model. There are two 
differences with the checkerboard reconstructions. Firstly, we will compare several different wavelet 
families for the l\ and £q reconstructions (in this case, there is no obvious reason to prefer Haar 
wavelets over other wavelet bases). Secondly, we will repeat the reconstruction experiment for an 
operator that has only 20000 kernels instead of 24000, to simulate imperfect coverage of the model 
domain by the kernels. In other words, with the 20000 kernel reconstruction, a region of the model is 
ill resolved. 

The wavelet families used are, in order of increasing smoothness: Haar [32 j . D4 |33[ I34j and so- 
called directional dual tree (DT) wavelets |35[ [36] . The Haar and D4 wavelet transform on the cube 
are direct products of the corresponding wavelet transforms in ID. The DT wavelet transform is not 
and it has, by construction, better directional sensitivity. The D4 wavelets that we will use do not 
suffer from edge effects as they do not use periodic boundary conditions, but follow the interval scheme 
proposed in [331 El]- Other model parameterizations that could be used are shearlets or curvelets (they 
are particularly suited to sparsely represent models with singularities along curves or surfaces), but 
we did not include them in our study [371 EB1 G2] ■ 

Judging the success of an algorithm to reconstitute the input model invariably involves a degree 
of subjectiveness, even if one designs a numerical measure for goodness of model fit. Such measure 
might also depend on the goal of the scientific experiment conducted. For example, if one deducts 
temperatures from velocity variations, it is more important that the amplitudes are correct and less 
important that sharp edges of an anomaly are preserved, but a structural geologist may be more 
interested in the edges and may wish to involve the misfit of the gradient, for example. 

In Table [21 we list the amplitude misfit (||m — m input ||/||m mput ||) and judge the fit to other features 
visually. A single horizontal slice (number 25) of the different reconstructions is pictured in Figure 
[6] and all 64 horizontal slices are shown in Figure A. 4 in the online supplementary material. For the 
reconstructions using all 24000 kernels, the TV method works best based on the final resulting error, as 
well as visual inspection. It is closely followed by the l\ method using dual tree wavelets (f^DT) and 
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Figure 6: Horizontal slice number 25 of the salt dome model and its various reconstructions using 
all 24000 kernels. The TV, £ 2 -A, 4-DT and £q-T)T methods introduce the most smoothing. The l\ 
and lo Haar reconstructions have the least smoothing but are very blocky. A good compromise, in 
this respect, may be found in the £i-D4 reconstruction. The £o-D4 reconstruction is less appealing. 
The bottom right figure shows five representative kernel cross sections for different frequencies. These 
cross sections are taken perpendicular to and in the middle of the central rays , and give the reader 
an idea of the size of the kernels relative to the structure in the model and of the relative Fresnel zone 
widths of the different kernels. ^ 
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Figure 7: Horizontal slice number 25 of the salt dome model and its reconstructions using only 20000 
kernels and thus non-uniform coverage. The area of the input model that lies in the region that is not 
covered by any kernel in this slice (indicated by a dashed circle), is not well reconstructed. Different 
methods compensate for the missing information in different ways. The £o~D4 reconstruction shows a 
distinctive artifact. As in Fig. the bottom right pane shows five representative kernel cross sections. 
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by the £2 method with Laplacian smoothing A). The £\ method with D4 wavelets does better than 
with Haar wavelets, that has a relative reconstruction error almost as bad as obtained using the simple 
£2 penalized method. However, the "top three" methods (TV, 4-DT, ^2-A) produce much smoother 
models than the input model. In case the correct sharpness of features is a desirable characteristic 
of the solution, the £\ reconstructions with Haar or D4 wavelets are more faithful to the input data. 
In this case one may well prefer D4 over Haar to avoid the rather blocky nature of the shapes. The 
qualitative differences with the noisy £2 reconstruction are obvious. The £q reconstructions that were 
obtained using iterative hard-thresholding, are less appealing. Numerically the £o _ H aar and £q-T>A 
perform worst of all reconstructions, whereas £o~T)T comes in fifth. 

The reconstructions with only 20000 kernels are shown in Figure [7J (a single horizontal slice) and 
Figure A. 5 in the supplementary material (all horizontal slices). The most interesting comparisons are 
again done visually. The lack of data coverage affects most strongly the areas around voxel (10,40) 
in slice 25 (Figure [7]) and the lower left corner of slices 13-44 (third and fourth row in Fig. A. 5 
especially). Not surprisingly, none of the algorithms is able to 'recreate' the model where there are 
no data at all. But close inspection of the model near the edge of the region affected by the data gap 
shows that the Haar and D4 wavelets produce the model that is least contaminated by smoothing 
effects beyond the gap, with D4 occasionally trying to correctly 'fill in'. The £q-T>A reconstruction 
creates a distinctive artifact in this area. We speculate that this is caused by the non-convex nature 
of the £q problem (fl~9j) . 

4.3 Computational aspects 

Apart from the aspect of the visual reconstruction quality it is also important to compare reconstruc- 
tion times. The four numerical algorithms that were used — conjugate gradient, fast iterative soft and 
hard thresholding, and reweighed conjugate gradient — all need one application of A and one applica- 
tion of A T per iteration step. These dominate the other, but much faster, operations such as addition 
of vectors, vector norms etc that are also present in each iteration step. The forward and inverse 
wavelet transforms that are used in some of the methods (via the Fast Wavelet Transform algorithm) 
also take a negligible time compared to an application of A and A T . It follows that it is sufficient to 
compare the number of iterations when we evaluate the efficiency of different reconstructions. 

The number of iterations and corresponding relative reconstruction errors ||m — m input ||/||m mput || 
are given in Table [2j In all cases, the iterative reconstruction algorithms were started from w^ ** = 
or m(°) = 0. 

For the checkerboard model the data in Table[2]show that the £\ and £q methods do extraordinarily 
well, with a mean square error far below what could be expected based on the data noise level of 10%. 
In the same sense, the total variation minimization and the Laplacian penalization perform somewhat 
better than the simple £2 method; a large number of simulations with different noise realizations would 
be necessary to verify whether this is a statistically significant difference. In case of the checkerboard 
reconstructions, only 100 iterations were performed for each method. This shows that the £\ and £q 
methods can be very successful if the sought after model is very sparse in the basis used, even with a 
limited number of iterations. 

We have also verified that the functionals (|3D, (0), (|10p . (|19p and (|22p remain almost constant after 
this number of iterations, as did the relative distance to the input model. 

In case of the salt dome reconstructions, the ^o~Haar, ^o _ D4 and the simple £2 method do worst 
(in terms of reconstruction error) closely followed by the £i-Haar method. The latter is due to the 
inappropriateness of the Haar basis to represent the salt dome model in a sparse fashion. The other 
three methods (£i-D4, fi-DT and TV) do about equally as the £2 — A method for the salt dome 
reconstruction with 24000 data. 
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Table 2: The number of iterations and the resulting reconstruction error for the various models and 
methods. 



To gauge wether there is a significant difference in reconstruction error between the TV, 4-DT 
and i\ — A methods (and possibly the £o-DT technique), one would also need to repeat the numerical 
experiment with many noise realizations. 

The nonlinear reconstructions with 24000 data were done with 1000 iterations. We used formula 
(|15p to derive a rough upper bound on the relative error remaining in the functional (|10j) w.r.t to the 
minimum after this number of iterations: 

i?( w W) _i?(w) 4||w(°)-w|| 2 » 

F(wW) - a(n + l) 2 F(wW) ~ ' k ; 

where we used w(°) = and approximated ||w|| by ||w^||. In other words, the calculated value of 
the minimum of the functional is accurate up to three decimal places (this bound is valid for the 
three wavelet families). A simple plot of F(w n ^) as a function of n also reveals that the functional 
is virtually constant after 1000 iterations, a conclusion which also holds for the TV method, the £q 
method and for the £2 methods (after 100 iterations). 

The corresponding reconstructions with 20000 data were done with only 100 iterations. In this 
way we demonstrate that a reasonable result can already be obtained without an excessively long 
computation time. This is evident by comparing Figures and [7] (or Figures A. 4 and A. 5 in the 
supplementary material): apart from the unresolved region near (0.24, —0.7, —0.23) the reconstructions 
are pairwise almost identical, despite the significant differences in number of iterations. In other words, 
the l\ and TV algorithms already succeed, after a small number of iterations, in producing qualitatively 
quite characteristic reconstructions. 

In case of the l§ reconstructions the differences (far from the unresolved region) are somewhat larger, 
we believe, because the £q method only finds a local minimum of ()19|) . 

As a result of the thresholding, the £0 and £\ algorithms provides sparse models at every iteration 
step (not just in the limit n — > 00). In other words it is not necessary, or desirable, to run the 
FISTA/Nesterov style algorithm for a very long time. Even after a small number of iteration, they 
will provide a sparse model that fits the data to within its error bars. 

5 Conclusions 

In Paper I we showed how a large scale anomaly could be reconstituted even where it was ill resolved 
because of the selective nature of the wavelet coefficients and the £\ criterion: one wavelet coefficient 
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reconstituting a large, circular, anomaly gave a better optimization than a couple of coefficients recon- 
stituting only the resolved part. With the results of the much more complex salt dome model at hand, 
we must now conclude that this probably represents more a (lucky) exception than a rule. There is 
no magical solution for the absence of data. 

For the checkerboard reconstructions, the l\ method with Haar wavelets is able to do very well 
— much better than could be expected based on the data themselves — because the Haar wavelets are 
very efficient in representing this particular checkerboard pattern in a sparse way. The success of the 
i\ method thus depends heavily on the choice of a suitable basis. For realistic models it is much more 
difficult to find a good — sparsifying — basis, and the reconstruction errors will be much larger. For 
the 3D salt dome reconstruction, one could argue that the fi-DT method does well because it has good 
directional sensitivity and is therefore able to adapt to the "curvy" nature of the outline of the salt 
body, as opposed to -^i-Haar and £i-D4 methods. The I2-A method does well because the Gaussian 
background that is present in the model is smooth 'noise' and this is exactly the prior information put 
into the minimization criterion. The TV method does well as the main part of the salt dome model 
is roughly piecewise constant and TV favors that. The £q methods do not perform particularly well, 
both from a quantitative as a qualitative side. 

The wavelets, however, do have the distinctive quality of retaining sharp features even when 
regularizing by penalizing highly oscillatory models. If the preservation of sharp boundaries is not as 
important as the correct estimation of amplitudes, the smoothed solution, using the method, is to 
be preferred as it is fully linear and efficient to solve with conjugate gradients. Methods using wavelets 
with small support, however, are able to retain sharp features, despite their regularization effect that 
penalizes highly oscillatory models. These methods are thus preferable when edges are important; our 
preference would go to the £i-D4 algorithm which gives less blocky solutions than £i-Haar. In no cases 
should one use simple norm damping (£2 method). Without imposing additional smoothing, i.e. while 
still allowing for sharp transitions, the l\ methods yield models which do not show signs of noise. 

The £q methods, which use hard-thresholding of wavelet coefficients rather than soft-thresholding, 
cannot outperform the l\ methods. In some cases they appear to produce severe artifacts. Another 
reason not to favor £q penalties is that they only produce a local minimum of the functional (|19p . This 
may lead to larger variability in the reconstructions (depending on the starting point of the iteration). 
There is currently no proven technique to tackle the minimization more efficiently that algorithm (|17p . 
The (unproven) method (|20p proposed in this paper is, as far as the authors can tell, new. 

We conclude that using hard-thresholding is less appealing than using soft-thresholding of wavelet 
coefficients: the mathematical theory is less developed, the hard-thresholded reconstruction may ex- 
hibit significant artifacts and the reconstructions are not better than the ones obtained with soft- 
thresholding (£\ method). 

Speed-wise the nonlinear methods cannot do better than the conjugate gradient algorithm for the 
£2 methods. Many applied mathematics groups [UU EH EJ H21 123 113 EI] are currently working on 
speeding up the iterative soft-thresholding algorithm (|lip . but it is still at least as time-consuming 
to use the £\ norm as it is to use the £2 norm for penalization, especially for severely ill-conditioned 
matrices and low noise conditions [45j . 

In case the data is heavily contaminated by noise, it follows from relation (|24p that a large value of 
the penalty parameter /i must be chosen. In [15] it was demonstrated that many competing algorithms 
for minimizing an l\ penalized functional converge quickly in such a case. We therefore expect that 
such methods remain competitive with the traditional £2 smoothing methods in case of travel time 
seismic tomography where the data noise level may reach 50%. 

Based on the results in this paper, we can conclude that the nonlinear methods offer a way to 
invert data and denoise the resulting model in a single procedure without necessarily smoothing the 
model too much. The two salt dome examples also show that a good reconstruction, clearly showing 
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the characteristic effects of the penalizations used, is still possible with a very limited number of 
iterations: This is a consequence of the FISTA algorithm producing sparse models at every iteration 
step. Sparse models can therefore be constructed with few iterations and little computer time (see 
[46J for a discussion of the number of iterations used as a regularization parameter) . 

As an alternative to D4 or complex DT wavelets one could consider using curvelets or shearlets, as 
they are naturally designed to sparsely represent singularities along smooth curves, such as, e.g., the 
sediment salt interface in our model. In this work we have not studied how the different regularization 
methods behave in conjunction with these particular choices of dictionaries. 
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